Adding Wilson Score Confidence Interval Strategy #567

zeotuan · 2024-05-06T22:45:13Z

Fixes #563

Description of changes:

Update RetainCompletenessRules and FractionalCategoricalRangeRule to accept and configure ConfidenceIntervalStrategy parameter
Add Wilson Score Confidence Interval Strategy and Wald Interval Strategy (current default)
Make Wilson Score Confidence Interval Strategy the new default method

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zeotuan · 2024-05-13T05:40:14Z

@rdsharma26 Hi can you help review this PR.

zeotuan · 2024-05-13T05:45:15Z

src/main/scala/com/amazon/deequ/suggestions/rules/interval/ConfidenceIntervalStrategy.scala

+object ConfidenceIntervalStrategy {
+  val defaultConfidence = 0.95
+
+  case class ConfidenceInterval(lowerBound: Double, upperBound: Double)


Currently also calculate upperBound for these ConfidenceInterval. At the moment we don't actually make use of the upperBound though

rdsharma26 · 2024-05-17T16:31:05Z

Thank you for the PR! @zeotuan
The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

zeotuan · 2024-05-20T02:05:13Z

Thank you for the PR! @zeotuan The changes look good to me. Can you fix the failing build ?

One point I would like to discuss is making the Wilson Score Confidence Interval Strategy the new default. Could this potentially break backwards compatibility in terms of behavior? If so, should we stick to the Wald Interval Strategy as the default and update the documentation so that users of Deequ can choose which one they want?

I think making Wilson Score Confidence Interval default right now might introduce some surprising changes to existing data quality pipelines. I will add example usage documentation for this.
We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

rdsharma26 · 2024-05-21T18:54:30Z

We can potentially introduce plan to change the default in a major version update and add "deprecation" message for now so user migrate themselves.

This seems like a safe approach. Could we configure the default to be the Wald strategy in the following line?

 private val defaultIntervalStrategy: ConfidenceIntervalStrategy = WilsonScoreIntervalStrategy()

The build also failed due to:

error file=/home/runner/work/deequ/deequ/src/test/scala/com/amazon/deequ/suggestions/rules/interval/IntervalStrategyTest.scala message=expected start of definition, but was Token(VAL,val,1285,val)

rdsharma26 · 2024-05-22T19:02:23Z

src/main/scala/com/amazon/deequ/suggestions/rules/FractionalCategoricalRangeRule.scala

@@ -23,16 +23,17 @@ import com.amazon.deequ.metrics.DistributionValue
 import com.amazon.deequ.profiles.ColumnProfile
 import com.amazon.deequ.suggestions.ConstraintSuggestion
 import com.amazon.deequ.suggestions.ConstraintSuggestionWithValue
+import com.amazon.deequ.suggestions.rules.FractionalCategoricalRangeRule.defaultIntervalStrategy
+import com.amazon.deequ.suggestions.rules.interval.{ConfidenceIntervalStrategy, WilsonScoreIntervalStrategy}


nit: Could we avoid grouped imports and use one import per line?

Just to clarify, do we prefer separate import or single import but with each on a single line

import com.amazon.deequ.suggestions.rules.interval.ConfidenceIntervalStrategy import com.amazon.deequ.suggestions.rules.interval.WilsonScoreIntervalStrategy

or

import com.amazon.deequ.suggestions.rules.interval{ ConfidenceIntervalStrategy, WilsonScoreIntervalStrategy }

The former. It helps with automatic resolution of merge conflicts.

rdsharma26 · 2024-05-22T19:02:46Z

Left a comment and a unit test needs fixing.

rdsharma26

LGTM! Thanks @zeotuan for your continued contribution to Deequ!

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.4 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Match Breeze version with spark 3.3 (#562) * Updated version in pom.xml to 2.0.8-spark-3.3 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * Updated version in pom.xml to 2.0.8-spark-3.2 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <[email protected]> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <[email protected]> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <[email protected]> * pdated version in pom.xml to 2.0.8-spark-3.1 --------- Co-authored-by: zeotuan <[email protected]> Co-authored-by: tylermcdaniel0 <[email protected]> Co-authored-by: Tyler Mcdaniel <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: Joshua Zexter <[email protected]> Co-authored-by: bojackli <[email protected]> Co-authored-by: Josh <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]> Co-authored-by: Shriya Vanvari <[email protected]>

* Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import

zeotuan added 5 commits April 19, 2024 11:06

Configurable RetainCompletenessRule

3b41e4c

Add doc string

ac337ea

Add default completeness const

db9b764

Add ConfidenceIntervalStrategy

91b1728

resolve master conflict

f096cd6

zeotuan changed the title ~~Tpm/interval strategy~~ Adding Wilson Score Confidence Interval Strategy May 6, 2024

Add Separate Wilson and Wald Interval Test

8cbffcd

zeotuan commented May 13, 2024

View reviewed changes

zeotuan added 2 commits May 16, 2024 17:38

Add License information, Fix formatting

3a9916f

Add License information

d27cb9b

zeotuan added 2 commits May 20, 2024 13:35

formatting fix

3f849f8

Update documentation

71d6e3f

Make WaldInterval the default strategy for now

387ab81

rdsharma26 reviewed May 22, 2024

View reviewed changes

zeotuan added 2 commits May 23, 2024 13:35

Formatting import to per line

913c795

Separate group import to per line import

bfe2c78

rdsharma26 approved these changes May 24, 2024

View reviewed changes

rdsharma26 merged commit 101142e into awslabs:master May 24, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Wilson Score Confidence Interval Strategy #567

Adding Wilson Score Confidence Interval Strategy #567

zeotuan commented May 6, 2024 •

edited

Loading

zeotuan commented May 13, 2024

zeotuan May 13, 2024

rdsharma26 commented May 17, 2024

zeotuan commented May 20, 2024

rdsharma26 commented May 21, 2024 •

edited

Loading

rdsharma26 May 22, 2024

zeotuan May 23, 2024

rdsharma26 May 23, 2024

zeotuan May 23, 2024

rdsharma26 commented May 22, 2024

rdsharma26 left a comment

Adding Wilson Score Confidence Interval Strategy #567

Adding Wilson Score Confidence Interval Strategy #567

Conversation

zeotuan commented May 6, 2024 • edited Loading

zeotuan commented May 13, 2024

zeotuan May 13, 2024

Choose a reason for hiding this comment

rdsharma26 commented May 17, 2024

zeotuan commented May 20, 2024

rdsharma26 commented May 21, 2024 • edited Loading

rdsharma26 May 22, 2024

Choose a reason for hiding this comment

zeotuan May 23, 2024

Choose a reason for hiding this comment

rdsharma26 May 23, 2024

Choose a reason for hiding this comment

zeotuan May 23, 2024

Choose a reason for hiding this comment

rdsharma26 commented May 22, 2024

rdsharma26 left a comment

Choose a reason for hiding this comment

zeotuan commented May 6, 2024 •

edited

Loading

rdsharma26 commented May 21, 2024 •

edited

Loading